Introduction to Data Science in Python by Appsilon

Course introduction

Piotr Pasza Storożenko@Appsilon

Introduction

About me

Piotr Pasza Storożenko, Machine Learning Engineer

A bit:

  • ML guy
  • Physists
  • Computer scientist
  • Mathematician

you can find more on my blog: pstorozenko.github.io.

Why this course?

Through the years I got a lot of knowledge about python, R, julia, data science, machine learning, deep learningu, softer developemnt.

Now I can share it with you!

What will be interesting for whom during this course?

  • Computer scientists
    • Easy to use and efficient tools to work with data
    • Made by software developers
  • Mathematicians
    • Intuitive tools that support reporoducible experiments
    • Ridiculously easy work with plots
  • Electronicians, mechanicians
    • Great alternative to matlab
    • Easy to use
    • Simple plots and animations
  • Economic students
    • Great alternative for spreadsheets
    • Set of free, open source tools
    • Uncomparable greater control of data when compared to MS Office, Tableau, PowerBI etc.
  • Physists, chemicians, biologists
    • Substential relief from spreadsheets
    • Open-source software
    • Much easier to create, reproducible plots

Data Science

What’s Sata Science?

Why Data Science?

Explanation of various terms

  • Artificial Inteligence, AI – an umbrella term connceted to everything where the computer/system makes decisions based on a set of rules, on an algorithm.
  • Data Science (DS) – everything related to data, from collecting, through processing, up to displaying and using
  • Machine Learning, ML – everyting related to creating/training models that are able to learn rules based on provided data
  • [Deep] Neural Networks, [D]NN – subset of ML methods based on special class of models, so called neural networks. Their architecture (design) somehow resembles connections between biological neurons, hence the name.

Who’s a Data Scientist?

Someone who simuntinesly:

  1. Discusses with so called bussiness the topic of required solutions.
  2. Based on provided and collected data creates solutions using programming skills.
  3. Delivers the results to the bussiness in a clear and interesing way.

Bussiness talks about AI, experts preffer to say ML

Data Science Workflow

Data Science Tools

Course plan

  1. Introduction and numpy - working numbers
  2. pandas - working with data frames
  3. matplotlib and plotly - plotting data
  4. scikit-learn - introduction to machine learning
  5. streamlit, quarto, fastapi - sharing your work

Course plan

Data Science Workflow x This course

Data Science Workflow x This course ++

Tools used in the course

Tools used in the course

  • Python 3.9/3.10 via Anaconda + many additional packages
  • Visual Studio Code aka VS Code

Anaconda

Anaconda is a standard when it comes to managing python environemnts in data science/machine learning community. It allows to obtain a consistent environment among various systems.

Why is it that important?

Data scienctists often work on many projects at the same time. Each project might require different environment, with specific versions of python and other libraries.

This might be also a relief when working on different projects during studying!

Anaconda - How to create an environment

After installing anaconda, you have to clone this course repo, and move yourself in terminal into course repo. Then run:

conda create -n appsilon-ds-course python=3.10 -y
conda activate appsilon-ds-course
pip install -r requirements.txt

If you received the following error message:

ERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'

you’re in a wrong directory.

In case of problems, check out the official conda tutorial.

VSCode - code editor for 2022

Why VS Code?

  • Creat support for both python scripts and jupyter notebooks.
  • Automatically detects conda environments
  • Great support for working with remote machines through SSH (althogh we will not use this feature)
  • One tool to work with python, R, julia, javascript, typescript etc.
  • Above this all – VS Code is free

What to do after installing VS Code?

Install extensions

Environment for work and studing is ready!

pip vs conda vs pipenv vs …

We need multiple environments on a single machine.

How to live, what to use?

NEVER PLAY WITH DATA SCIENCE ON YOUR DEFAULT SYSTEM’S PYTHON

pip + virtualenv

  1. A basic package manager included in python
  2. Works only for a single version of python
  3. Capable of installing python packages only
  4. Basic package versioning with pip freeze
  5. Pretty fast when doesn’t have to build packages

conda

  1. A package manager provided by Anaconda
  2. Allows for creating different environments for different major (3.9/3.10) and minor (3.10.3/3.10.4) python versions
  3. Is able to install other software than python packages as well (e.g. R or CUDA drivers)
  4. Basic package versioning with conda list --export
  5. Super slow for bigger environments
  6. Packages installed with conda can be shared across environments – lower disk usage (sole PyTorch is ~1.7GB)

pipenv

  1. Looks like pip + virtualenv plus different python versions
  2. Very big focus on environment reproducibility
  3. Super slow for bigger environments

How to live?

The most reliable setup for experimenting is to do:

conda create -n my-env python==3.10.4
conda activate my-env
pip install ...

If you need to install CUDA drivers then do it on environment creation conda create -n my-env python cudatoolkit.

After you install all packages, save in README file python version, e.g.,

Project created with python 3.10.4.

and store installed packages with pip freeze > requirements.txt.

Remember that not every package version is available for every python version. For example Tensorflow 2.10 is supported only in python>=3.10.